NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

EdgeWeaver: Accelerating IoT Application Development Across Edge-Cloud Continuum

Lertpongrujikorn, Pawissanutt; Kwon, Juahn; Nguyen, Hai Duc; Amini_Salehi, Mohsen (December 2025, IEEE)

Free, publicly-accessible full text available December 22, 2026
Resilient execution of distributed X-ray image analysis workflows

https://doi.org/10.3389/fhpcp.2025.1550855

Nguyen, Hai Duc; Bicer, Tekin; Nicolae, Bogdan; Kettimuthu, Rajkumar; Huerta, E A; Foster, Ian T (June 2025, Frontiers in High Performance Computing)

Long-running scientific workflows, such as tomographic data analysis pipelines, are prone to a variety of failures, including hardware and network disruptions, as well as software errors. These failures can substantially degrade performance and increase turnaround times, particularly in large-scale, geographically distributed, and time-sensitive environments like synchrotron radiation facilities. In this work, we propose and evaluate resilience strategies aimed at mitigating the impact of failures in tomographic reconstruction workflows. Specifically, we introduce an asynchronous, non-blocking checkpointing mechanism and a dynamic load redistribution technique with lazy recovery, designed to enhance workflow reliability and minimize failure-induced overheads. These approaches facilitate progress preservation, balanced load distribution, and efficient recovery in error-prone environments. To evaluate their effectiveness, we implement a 3D tomographic reconstruction pipeline and deploy it across Argonne's leadership computing infrastructure and synchrotron facilities. Our results demonstrate that the proposed resilience techniques significantly reduce failure impact—by up to 500× —while maintaining negligible overhead (<3%).
more » « less
Free, publicly-accessible full text available June 6, 2026
D-Rex: Heterogeneity-Aware Reliability Framework and Adaptive Algorithms for Distributed Storage

https://doi.org/10.1145/3721145.3730412

Gonthier, Maxime; Sanchez-Gallegos, Dante D; Pan, Haochen; Nicolae, Bogdan; Zhou, Sicheng; Nguyen, Hai Duc; Hayot-Sasson, Valerie; Pauloski, Greg; Carretero, Jesus; Chard, Kyle; et al (June 2025, ACM)

Free, publicly-accessible full text available June 8, 2026
Streamlining Cloud-Native Application Development and Deployment with Robust Encapsulation

https://doi.org/10.1145/3698038.3698552

Lertpongrujikorn, Pawissanutt; Nguyen, Hai Duc; Salehi, Mohsen Amini (November 2024, ACM)

Full Text Available
A Foundation for Real-time Applications onFunction-as-a-Service

Nguyen, Hai Duc; Chien, Andrew A. (February 2024, A Foundation for Real-time Applications onFunction-as-a-Service)

Serverless (or Function-as-a-Service) compute model enables new applications with dynamic scaling. However, all current Serverless systems are best-effort, and as we prove this means they cannot guarantee hard real-time deadlines, rendering them unsuitable for such real-time applications. We analyze a proposed extension of the Serverless model that adds a guaranteed invocation rate to the serverless model called Real-time Serverless. This approach aims to meet real-time deadlines with dynamically allocated function invocations. We first prove that the Serverless model does not support real-time guarantees. Next, we analyze Real-time Serverless, showing it can guarantee application real-time deadlines for rate-monotonic real-time workloads. Further, we derive bounds on the required invocation rate to meet any set of workload runtimes and periods. Subsequently, we explore an application technique, pre-invocation, and show that it can reduce the required guaranteed invocation rate. We derive bounds for the feasible rate guarantee reduction, and corresponding overhead in wasted compute resources. Finally, we apply the theoretical results to improve the experience quality of a distributed virtual reality/ augmented reality application as well as simplify the application design and resource management.
more » « less
Full Text Available
Storm-RTS: Stream Processing with Stable Performance for Multi-Cloud and Cloud-edge

https://doi.org/10.1109/CLOUD60044.2023.00015

Nguyen, Hai Duc; Chien, Andrew A. (July 2023, IEEE)

Stream Processing Engines (SPEs) traditionally de-ploy applications on a set of shared workers (e.g., threads, processes, or containers) requiring complex performance man-agement by SPEs and application developers. We explore a new approach that replaces workers with Rate-based Abstract Ma-chines (RBAMs). This allows SPEs to translate stream operations into FaaS invocations, and exploit guaranteed invocation rates to manage performance. This approach enables SPE applications to achieve transparent and predictable performance. We realize the approach in the Storm-RTS system. Exploring 36 stream processing scenarios over 5 different hardware config-urations, we demonstrate several key advantages. First, Storm-RTS provides stable application performance and can enable flexible reconfiguration across cloud resource configurations. Sec-ond, SPEs built on RBAM can be resource-efficient and scalable. Finally, Storm-RTS allows the stream-processing paradigm to be extended from the cloud to the edge, using its performance stability to hide edge heterogeneity and resource competition. An experiment with 4 cloud and edge sites over 300 cores shows how Storm-RTS can support flexible reconfiguration and simple high-level declarative policies that optimize resource cost or other criteria.
more » « less
Full Text Available
Motivating High Performance Serverless Workloads

Nguyen, Hai Duc; Yang, Zhifei; Chien, Andrew A. (June 2021, THE 1ST WORKSHOP ON HIGH PERFORMANCE SERVERLESS COMPUTING)
Foster, Ian; Chard, Kyle; Babuji, Yadu (Ed.)
The historical motivation for serverless comes from internet-of-things, smartphone client server, and the objective of simplifying programming (no provisioning) and scale-down (pay-for-use). These applications are generally low-performance best-effort. However, the serverless model enables flexible software architectures suitable for a wide range of applications that demand high-performance and guaranteed performance. We have studied three such applications - scientific data streaming, virtual/augmented reality, and document annotation. We describe how each can be cast in a serverless software architecture and how the application performance requirements translate into high performance requirements (invocation rate, low and predictable latency) for the underlying serverless system implementation. These applications can require invocations rates as high as millions per second (40 MHz) and latency deadlines below a microsecond (300 ns), and furthermore require performance predictability. All of these capabilities are far in excess of today's commercial serverless offerings and represent interesting research challenges.
more » « less
Full Text Available
Real-time Serverless: Enabling Application Performance Guarantees

https://doi.org/10.1145/3366623.3368133

Nguyen, Hai Duc; Zhang, Chaojie; Xiao, Zhujun; Chien, Andrew A. (December 2019, Proceedings of the International Workshop on Serverless Computing)

Today's serverless provides "function-as-a-service" with dynamic scaling and fine-grained resource charging, enabling new cloud applications. Serverless functions are invoked as a best-effort service. We propose an extension to serverless, called real-time serverless that provides an invocation rate guarantee, a service-level objective (SLO) specified by the application, and delivered by the underlying implementation. Real-time serverless allows applications to guarantee real-time performance. We study real-time serverless behavior analytically and empirically to characterize its ability to support bursty, real-time cloud and edge applications efficiently. Finally, we use a case study, traffic monitoring, to illustrate the use and benefits of real-time serverless, on our prototype implementation.
more » « less
Full Text Available

Search for: All records